User interface for context labeling of multimedia items

ABSTRACT

In certain embodiments, a neural network may be trained to associated context information with multimedia items. In some embodiments, context predictions for multimedia items may be obtained via a neural network. A first multimedia item and a first task related to a first context prediction for the first multimedia item may be presented on a user interface. A user response to the first task may be obtained via the user interface. Based on the user response to the first task, prediction feedback related to the first context prediction or the first multimedia item may be provided to the neural network to cause the neural network to be updated based on the prediction feedback.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/176,413, filed on Feb. 16, 2021, which is a continuation of Ser. No.16/251,317, filed Jan. 18, 2019, which is a continuation of U.S. patentapplication Ser. No. 15/002,248, filed Jan. 20, 2016, which claims thepriority benefit of U.S. Provisional Patent Application No. 62/106,648,filed Jan. 22, 2015, each of which is incorporated herein by reference.

FIELD OF THE DISCLOSURE

The present disclosure relates generally to organizing a plurality ofmultimedia items stored in a repository of multimedia items usingautomatically generated labels.

BACKGROUND

Association of context information with multimedia items may allow foran efficient organization of the multimedia items. By way ofnon-limiting, example, users may be able search for multimedia itemsstored in a repository of multimedia items based on the contextinformation associated therewith. Context information may be stored asmetadata associated with the multimedia items.

SUMMARY

One or more aspects of the disclosure relate to a system for associatingcontext information with individual ones of multimedia items stored in arepository of one or more multimedia items. A given multimedia item mayinclude one or more of an image, a video, audio, a text file,combinations thereof, and/or other considerations. The contextinformation associated with individual ones of the multimedia items maybe referred to as “labels,” “tags,” and/or other terms. An associationof context information to individual ones of the multimedia items may bereferred to as “labeling,” “tagging,” and/or other terms. Labels and/orother information associated with a multimedia item may be stored asmetadata associated with the multimedia item, in some implementations, alabel and/or other metadata may include one or more of a category, ageolocation, a timestamp, a price, semantic description, contentdescription, a rating, and/or other information associated with a givenmultimedia item that may provide context for a multimedia item.

In some implementations, one or more components of the system may beconfigured such that labels and/or other information associated with amultimedia item may be initially and/or automatically associated withthe multimedia item based on a preprocessing of the multimedia items.The initially associated labels may be predictions of labels that may beassociated with the multimedia item. The preprocessing may include oneor more of automatically clustering the multimedia items into groups(e.g., grouping one or more multimedia items together based onsimilarity between the labels of the multimedia items and/or semanticsimilarity), and/or other operations. In some implementations, apreprocessing of the multimedia items may facilitate a technique fororganizing the multimedia items for presentation to one or more users inone or more user interfaces. In some implementations, a user may bepresented with a set of multimedia items based on the automaticclustering of the multimedia items achieved in a preprocessing step. Auser may carry out one or more labeling tasks for adding a label,removing a label, changing a predicted label, confirming a predictedlabel, and/or providing other input. In some implementations, a user mayadd, remove, change, and/or confirm the predicted labels of the set ofmultimedia content all at once, rather than for individual ones of themultimedia items one at a time.

By way of non-limiting example, types of labeling tasks determined to becarried out by one or more users may correspond to one or more oflabeling items with a fixed vocabulary; labeling items with anopen-ended vocabulary; labeling items with one label per item; labelingitems with multi labels per item; labeling regions, ranges, and/or partsof an item (e.g., bounding boxes of an image, time ranges of video oraudio, and/or other considerations); labeling relationships betweenindividual items (e.g., within individual groupings); labelingrelationships between groups of items; ranking items by quality,preference, relevance, and/or other information; and/or other tasks.

In some implementations, one or more aspects of the disclosure presentedherein may improve an efficiency by which one or more users add, remove,change, confirm, and/or provide other input related to one or more labelpredictions. Thus the predicted labels and/or other metadata may providean initial organization of the multimedia items for presentation to theuser on a user interface. One or more aspects of organizationfacilitated herein may leverage principles of cognitive and/orperceptual psychology.

In some implementations, one or more aspects of the disclosure mayutilize principles derived from methodologies of human cognition andperception to increase labeler efficiency in several ways. By way ofnon-limiting example, one or more aspects of the disclosure may achieveimproved labeling efficiency by: reducing cognitive load by minimizingcontext switches; labeling many images with one label before changing anassociation of the labels; grouping similar multimedia items together toform repeating patterns; sorting objects by confidence of the predictionso that the user can quickly skip over regions of highhomogeneity/consistency, since it may be easier to detect overall trendsover groups of items rather than individual items; offering binarydecisions where possible, rather than asking a user to choose frommultiple options; asking a user to make decisions based on a set ofimages together, rather than one image at a time; asking questions whoseanswers are immediately obvious from the provided data, rather thaninvolving additional context or thought; as well as other advantagesthat may become apparent to one skilled in the art upon a full readingof this disclosure. The system may facilitate labeling decisions thatare fast, objective, and/or accurate representations of the associatedmultimedia items. By way of non-limiting example, human perceptualsystems may be well optimized for spotting differences/disruptions inrepeating patterns, rather than looking for particular objects in anarray of visually heterogeneous images.

In some implementations, one or more aspects of the disclosure may beused to generate examples of correct (e.g., up to some thresholdconfidence score) associations of context information with multimediaitems to train a machine learning system that may be configured to labelmultimedia items automatically. By way of non-limiting example,providing a set or sets of multimedia items having metadata associatedtherewith with a high degree of confidence may provide a technique to“bootstrap” from a machine learning prediction model to subsequentiterations of the machine learning prediction model having increasingaccuracy. The initial machine learning prediction model may beassociated with a high degree of association errors but may be accurateenough to be used with one or more implementations where a user maysubsequently add, remove, change, and/or confirm the initial (e.g.,predicted) labeling and/or provide other input. Subsequent iterations ofthe machine learning prediction model may be, trained on increasingly“Clean” training data (e.g., based on user additions, removal, changes,and/or confirmations of predictions and/or other input), which generallyproduces better-performing machine learning models and may subsequentlyincrease labeling efficiency.

In some implementations, one or more aspects of the disclosure may beused to generate labels to be indexed by a search engine for searching alarge repository of multimedia items.

In some implementations, one or more aspects of the disclosure may beused to moderate and/or filter multimedia content to remove/confirmitems that don't match a set of labels or other criteria. This mayinclude removing off-topic, offensive, and/or low-quality multimediadata, and/or other considerations.

In some implementations, the system may comprise one or more hardwareprocessors configured to execute one or more computer programcomponents. The computer program components may include a preprocessingcomponent, a task component, a user interface (UI) component, a learningcomponent, and/or other components. In some implementations, thecomputer program components may be housed in the same or differentphysical processor(s).

In some implementations, the preprocessing component may be configuredto obtain one, or more multimedia items from a repository of one or moremultimedia items. Individual ones of the multimedia items may includeone or more of an image, a video, audio, or a text file, and/or otherconsiderations. The preprocessing component may be configured to predictcontext information to be associated with individual ones of themultimedia items. In some implementations, the prediction may be basedon a machine learning prediction model, and/or other information. Thepreprocessing component may be configured to determine confidence scoresfor individual ones of the predicted labels. A confidence score maycorrespond to a degree of certainty for uncertainty) that the predictedlabels accurately represent the associated multimedia item and/or otherconsiderations.

In some implementations, the preprocessing component may be configuredto determine semantic representations for one or more multimedia items.By way of non-limiting example, a given semantic representation may bedetermined based on properties of the multimedia item, provided metadataand/or predicted metadata, and/or other information. In someimplementations, the semantic representation may be used to determinesemantic similarity between multimedia items. By way of non-limitingexample, the semantic representation may comprise a discrete scalarvalue, a multi-dimensional quantity, a vector, a signal, a graphicrepresentation, a text description, and/or other considerations. In someimplementations, a semantic similarity score may be computed and/ordetermined from the semantic representations. In some implementations,pairwise similarity scores may be computed directly between pairs ofmultimedia items. Briefly, semantic representations and/or similarityscores may be used to cluster multimedia items into groups based onsemantic similarity and/or to retrieve semantically similar items from adatabase

In some implementations, the task component may be configured togenerate one or more tasks and/or sets of tasks to be performed by oneor more users via user interfaces. Tasks may correspond to presentingusers options for adding, removing, changing, and/or confirmingpredicted labels associations within a user interface, and/or providingother input. The task component may be configured to select one or morelabels or sets of labels from the labels predicted by the preprocessingcomponent to correspond with individual ones of the tasks. By way ofnon-limiting example, a first label and/or a first set of labels fromthe labels predicted by the preprocessing component may be selected fora first task. In some implementations, the selection may be based on theconfidence scores and/or other information. In some implementations,confidence scores may be used to prioritize the labels into a sequenceof tasks that may be presented to a user (e.g., tasks associated withlabels having low confidence scores may be prioritized over tasksassociated with labels having high confidence scores, and/or otherconsiderations).

In some implementations, the task component may be configured to selectone or more multimedia items having predicted labels that match thelabels selected by the task component. The task component may beconfigured to sort the selected multimedia items into a list based onone or more sorting metrics. A sorting metric may be based on one ormore of time (e.g., most recent), priority, predicted confidence score,similarity to an exemplar, aid/or other information

in some implementations, the task component may be configured toassociate multimedia items into item groups based on the semanticrepresentations (e.g., determined similarity scores, and/or otherinformation). The groups may be associated with group tasks thatcorrespond to sets of homogeneous multimedia items that may reducecognitive load on the user who subsequently adds, removes, changes,and/or confirms the predicted labels and/or provides other input relatedto the group.

The task component may be configured to associate individual multimediaitems and/or item groups with individual tasks to be presented to one ormore users via respective user interfaces. At some point in time, thetask component may be configured to determine if consensus has beenreached among users for labels associated with individual ones of themultimedia items and/or individual ones of the item groups. Onceconsensus has been determined, the tasks component may be configured toremove the associated tasks from the set of tasks to be completed by theone or more users.

The user interface component may be configured to effectuatepresentation of a user interface to one or more users via user devicesassociated with the users. The user interface may be configured todisplay one or more of the obtained multimedia items, predicted contextinformation associated with the one or more obtained multimedia items,and/or other information corresponding to one or more tasks and/or setsof tasks determined h the task component. The user interface componentmay be configured to obtain entry and/or selection of input from usersvia the user interface. The input may correspond to one or more ofadding, removing, changing, and/or confirming a predicted label for anindividual multimedia item and/or group of semantically similarmultimedia items, and/or other input. In some implementations, thepreprocessing component may be configured to update the predicted labelsbased on the user input.

The learning component may be configured to generate a new machinelearning prediction model and/or update an existing machine learningprediction model based on the user input.

These and other features, and characteristics of the present technology,as well as the methods of operation and functions of the relatedelements of structure and the combination of parts and economies ofmanufacture, will become more apparent upon consideration of thefollowing description and the appended claims with reference to theaccompanying drawings, all of which form a part of this specification,wherein like reference numerals designate, corresponding parts in thevarious figures. It is to be expressly understood, however, that thedrawings are for the purpose of illustration and description only andare not intended as a definition of the limits of the disclosure. Asused in the specification and in the claims, the singular forms of “a”,“an”, and “the” include plural referents unless the context clearlydictates otherwise.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a system for associating context information withmultimedia items, in accordance with one or more implementations.

FIG. 2 illustrates a diagram of information flow of a system forassociating context information with multimedia items, in accordancewith one or more implementations,

FIG. 3 illustrates a diagram of information flow in accordance withpreprocessing and/or task generation operations of a system forassociating context information with multimedia items, in accordancewith one or more implementations.

FIG. 4 illustrates an exemplary user interface configured to allow auser to carry out one or more labeling tasks, in accordance with one ormore implementations.

FIG. 5 illustrates a method of associating context information withmultimedia items, in accordance with one or more implementations.

FIG. 6 illustrates an exemplary user interface configured to allow auser to carry out one or more labeling tasks, in accordance with one ormore implementations.

DETAILED DESCRIPTION

FIG. 1 illustrates a system 100 for associating context information withone or more multimedia items, in accordance with one or moreimplementations. In some implementations, the multimedia items may beobtained from a repository 107 of multimedia items, and/or from otherlocations. A given multimedia item may include one or more of an image,video, audio, a text file, combinations thereof, and/or otherconsiderations. The context information associated with individual onesof the multimedia items may be referred to as “labels,” “tags,” and/orother terms. In some implementations, a label may generally refer to anyinformation associated with a multimedia item. By way of non-limitingexample, labels may include one or more of a category, a geolocation, atimestamp, pricing information, ratings, a semantic description, acontent description, and/or other considerations. An association ofcontext information to individual ones of the multimedia items may bereferred to as “labeling,” “tagging,” and/or other terms. Labels and/orother information associated with a multimedia item may be stored asmetadata associated with the multimedia item.

In some implementations, the association may be based on i) predictingassociations of context information with individual ones of themultimedia items, ii) verifying the predictions by presenting themultimedia items and/or predictions to one or more users via one or moreuser interfaces, and/or other operations. In some implementations, labelprediction (or interchangeably called tag prediction, metadataprediction, and/or classification) may comprise using machine learningalgorithms, object recognition, and/or other techniques to determinelabels directly from the content of the multimedia items. In someimplementations, verification may be comprised of one or more labelingtasks 221 carried out by one or more users. The verification provided byusers may be used to update a machine learning algorithm and/or generatenew machine learning algorithms and/or other technique(s) used for labelprediction.

In some implementations, labeling tasks 221 may include one or moreunits of work for one or more users. A given task may be completed by auser with a set of actions. By way of non-limiting example, this mayinclude providing entry and/or selection of elements displayed on a userinterface to add a label, remove a predicted label, change a predictedlabel, confirm a predicted label, and/or other input. The input may berecorded as a labeling response.

In some implementations, user interfaces may be hosted over acommunication network 106, such as the Internet. The user interface maybe hosted by a server 102 to user devices 104 associated with users ofthe system 100. The server 102 may comprise one or more hardwareprocessors 118 configured to execute one or more computer programcomponents. The computer program components may include a preprocessingcomponent 108, a task component 110, a user interface (UI) component112, a learning component 114, and/or other components. Users may accessthe system 100 and/or user interface (not shown in FIG. 1) via userdevices 104. User devices 104 may include, for example, a cellulartelephone, a smartphone, a laptop, a tablet computer, a desktopcomputer, a television set-top box, smart TV, a gaming console, and/orother device as described herein and/or other considerations:

It is noted that in some implementations, some or all of thefunctionality of server 102 may be attributed to one or more userdevices 104. In some implementations, the user devices 104 may includeone or more hardware processors (not shown) configured to executecomputer program components the same or similar to components 108-114.For example, the user devices. 104 may be configured to host the userinterface (not shown in FIG. 1) locally based on information storedlocally on the user devices 104. This implementation may be an “offline”implementation of the system 100, and/or other considerations. When userdevices 104 run offline, one or more components executed by processorsof the user devices 104 may gather information from tasks 221 carriedout by the user and communicate information to the server 102 when an“online” connection may be made, and/or other considerations.

FIG. 2 illustrates a diagram of information flow of a system 200 forassociating context information with multimedia items, in accordancewith one or more implementations. System 200 may be the same or similarto system 100 in FIG. 1. In some implementations, a preprocessingsubsystem 210 (e.g., facilitated by preprocessing component 108 inFIG. 1) may produce context information and/or other informationassociated with one or more multimedia items obtained from a repository207. The preprocessing subsystem 210 may facilitate organization ofsubsequent labeling tasks 221 to be carried out by one or more users,wherein the tasks 221 correspond to the associations of metadata toindividual ones of the multimedia items and/or groups of multimedia. Thepreprocessing subsystem 210 may be configured to use a first machinelearning prediction model 211 to generate initial predictions of labelsfor individual ones of the multimedia items, an auxiliary trainingcorpus 213 used to (at least initially) train the first machine learningprediction model 211, and/or other models and/or information. By way ofnon-limiting example, the auxiliary training corpus 213 may include oneor more multimedia items being associated with one or more labels,wherein the associations are based on a high degree of confidence and/oran exact match (e.g., facilitated by manual user input). Informationproduced from preprocessing subsystem 210 may be stored in a contentmetadata store 212 and/by otherwise communicated directly a taskgenerator subsystem 220.

The auxiliary training corpus 213 of labeled examples may be used inseveral ways. By way of non-limiting example, the auxiliary trainingcorpus 213 may be used to: train the first machine learning predictionmodel 211; to “amplify” the labels in the auxiliary corpus 213; and/orother considerations. In some implementations, the labels in theauxiliary corpus may be amplified by propagating the training labels ofthe auxiliary corpus to similar items from the repository 207, asdetermined by semantic representations (see, e.g., semantic similaritycomponent 320 in FIG. 3). By ay of non-limiting example, for amplifyinga given label X, one or more components of the system 200 may beconfigured to detect outliers in the repository 207 by finding items inthe repository 207 that are labeled with X but are not similar to anyitem labeled X in the auxiliary corpus 213, as determined by semanticrepresentations 321 and/or other information.

In some implementations, labels and/or other information may beassociated with segments or portions of multimedia items. By way ofnon-limiting example, labels may be associated with spatial regionsdefined by bounding boxes or polygons, time segments associated withstart and end ranges, and/or other considerations. Such labels may becomputed from the first machine learning prediction model 211 in severalways. By way of non-limiting example, operations such as running theprediction model 211 over multiple crops, locations, portions, and/orsegments of a given multimedia item; sorting the crops, locations,portions, and/or segments of the multimedia item by predictionconfidence; and/or other considerations. In some implementations wherethe first machine learning prediction model 211 may be configured toproduce localization metadata (including bounding boxes, timesegmentations, and/or other considerations), initial crops or segmentsmay be generated using the model directly.

In some implementations, the task generator subsystem 220 (e.g.,executed by task component 110 in FIG. 1) may be configured to generatetasks 221 to present to one or more users (e.g., user 208) via userinterfaces (e.g., user interface 230). The tasks 221 may be determinedbased on the initial predictions and/or other information.

By way of non-limiting example, user interface 230 may be configured(e.g., via user interface component 112 in FIG. 1) to present labelingtasks 221 to the user 208, receive responses 222 corresponding to userinput, communicate responses 222 back to the task generator subsystem220 for use when generating and/or removing, and/or otherconsiderations.

In some implementations, responses 222 may facilitate a retrainingprocess 240. The retraining process 240 may be configured to useresponses 222 recorded during the labeling tasks 221 to generate asecond machine learning prediction model and/or update the first machinelearning prediction model 211. In some implementations, the processesdescribed herein with reference to FIG. 2 may be iterated over one ormore tasks 221 for one or more different labels and/or multimedia itemsto generate prediction models having a relatively higher degree ofaccuracy over time and/or to continually update an initial predictionmodel.

Returning to FIG. 1, the preprocessing component 108 may be configuredto obtain one or more multimedia items from a repository 107 of one ormore multimedia items. Individual ones of the multimedia items mayinclude one or more of an image, a video, audio, a text file, and/orother considerations. The preprocessing component 108 may be configuredto predict context information associated with individual ones of themultimedia items. In some implementations, predictions may be based on amachine learning prediction model, and/or other information. Thepreprocessing component 108 may be configured to determine confidencescores for individual ones of the predicted labels. A confidence scoremay correspond to a degree of certainty (or uncertainty) that thepredicted labels accurately represent the associated multimedia item,and/or other considerations. A confidence score may be numerical (e.g.,points, amount, score, rank, ratings, grades, or any other type ofnumerical value), descriptive (e.g., text for a confidence level, and/orother considerations), progressive e.g., high confidence match, mediumconfidence match, low confidence match, and/or other considerations),pictorial (e.g., an image representing a confident facial expression, animage representing an unconfident facial expression, and/or otherconsiderations), and/or any other type of value for a confidence score.

In some implementations, the preprocessing component 108 may beconfigured to determine semantic representations 321 for multimediaitems. By way of non-limiting example, a given semantic representationmay be determined based on properties of a multimedia item, providedmetadata and/or predicted metadata, and/or other information. In someimplementations, the semantic representation may be used to determinesemantic similarity between multimedia items. By way of non-limitingexample, the semantic representation may comprise a discrete scalarvalue, a multi-dimensional quantity, a vector, a signal, a graphicrepresentation, a text description, and/or other considerations. In someimplementations, a semantic similarity score may be computed and/ordetermined from the semantic representations 321. In someimplementations, pairwise similarity scores may be computed directlybetween pairs of multimedia items. Briefly, semantic representations 321and/or similarity scores may be used to cluster multimedia items intogroups based on semantic similarity and/or to retrieve semanticallysimilar items from repository 107.

In some implementations, a semantic similarity score may be numerical(e.g., points, amount, score, rank, ratings, grades, or any other typeof numerical value), descriptive (e.g., text of a semantic descriptionand/or other considerations), progressive (e.g., high semantic match,medium semantic match, low semantic, and/or other considerations),pictorial (e.g., an image and/or other considerations), and/or any othertype of value for a semantic similarity score.

Semantic representations 321 may facilitate grouping two or moremultimedia items together such that semantically and/or perceptuallysimilar multimedia items may be grouped together based on a degree ofsimilarity, and/or other considerations. Techniques from the fields ofmachine learning and/or data analysis may be used to automate theprocess of grouping similar items together (e.g., clustering). In someimplementations, grouping may involve visually grouping multimedia itemsand/or representations of multimedia items by a visual distance (e.g., asmaller distance representing a closer semantic match), similarity in ametric space such as a visual-semantic embedding space, similarity inpredicted tag or text embedding space, a matrix or “kernel” of pairwisesimilarity scores generated from a machine learning algorithm trained topredict pairwise item similarity, and/or other considerations. By way ofnon-limiting example, multimedia items including images of dogs may begrouped with multimedia items including images of other dogs; multimediaitems including images of dogs of the same breed may be associated withanother group; a multimedia item including an image of a dog may begrouped closer with a multimedia item including an image of a cat thanwith a multimedia item including an image of a car; and/or otherconsiderations.

As an illustrative example in FIG. 3, in some implementations, thepreprocessing component 108 may include one or more of a metadataprediction component 310, a semantic similarity component 320, and/orother components. The metadata prediction component 310 may beconfigured to generate predicted metadata 311 and/or other informationassociated with one or more multimedia items. The predicted metadata 311may include label predictions, prediction confidence scores, and/orother information. By way of non-limiting example, the metadataprediction component 310 may be configured to execute object recognitionsoftware communicating with a machine learning algorithm (e.g., managedby the learning component 114) to extract information from one or bothof images and/or video to determine one or more labels to associate withthe images and/or video, and/or other considerations. A machine learningalgorithm (e.g., managed by the learning component 114) may include oneor more of a convolutional neural network, a machine learning predictionmodel (e.g., first machine learning prediction model 211 in FIG. 2),and/or other information. Another implementation may employ manualtagging of metadata by users to bootstrap this step. By way ofnon-limiting example, object recognition software may recognize, withina multimedia item including an image, a person holding a baseball bat ina stance ready to swing. The metadata prediction component 310 may beconfigured to associate a label such as “baseball” with the multimediaitem as a predicted label.

In some implementations, the semantic similarity component 320 may beconfigured to determine semantic representations 321 used for groupingmultimedia items by similarity. By way of non-limiting example,multimedia items may be embedded into a semantic similarity space basedon properties of the multimedia items, provided metadata and/orpredicted metadata 311, and/or other information. In someimplementations, pairwise similarity may be computed directly betweenpairs of multimedia items in the repository 107. Semanticrepresentations 321 may be determined and/or used in other ways.

By way of non-limiting example, the preprocessing component 108 may beconfigured to obtain a first multimedia item and/or other multimediaitems from the repository 107. The preprocessing component 108 may beconfigured to associate predicted first context information (e.g., afirst label) with the first multimedia item. The preprocessing component108 may be configured to associate predicted second context information(e.g., a second label) with the first multimedia item. The preprocessingcomponent 108 may be configured to determine a first confidence scoreregarding the association of the first context information with thefirst multimedia item. The preprocessing component 108 may be configuredto determine a first semantic representation associated with the firstmultimedia item. The first semantic representation may be used todetermine a semantic similarity between the first multimedia item and atleast one other multimedia item. The preprocessing component 108 may beconfigured to predict other information to associate with the firstmultimedia item.

Returning to FIG. 1, the task component 110 may be configured togenerate one or more tasks 221 and/or sets of tasks 221 to be performedby one or more users via user interfaces. Tasks 221 may correspond topresenting users options for adding context information (e.g., labels)to the metadata of a multimedia item, removing predicted contextinformation from metadata, changing the predicted context information,confirming predicted context information associations, and/or providingother input, facilitated by a user interface. The task component 110 maybe configured to select one or more labels or sets of labels from thelabels predicted by the preprocessing component 108 to correspond withindividual ones of the tasks 221.

As an illustrative example in FIG. 3, the task component 110 may includeone or more of a sampling component 330, a sorting component 340, aclustering component 350, a chunking component 360, a consensuscomponent 370, and/or other components. The sampling component 330 maybe configured to select certain metadata and/or portions of metadata(e.g., a label, a group of labels, and/or other considerations) from theset of predicted metadata 311 to be associated with one or more tasks221. The selection may be based on one or more of the predicted labels,confidence scores, semantic representations 321, semantic similarityscores, and/or other information. In some implementations, selection maybe based on prioritizing tasks 221. B way of non-limiting example,metadata that was associated with one or more multimedia items withrelatively low confidence may be included in tasks 221 that areprioritized over tasks. 221 corresponding to metadata that wasassociated with one or more multimedia items with a higher degree ofconfidence, and/or other considerations.

In some implementations, the sampling component 330 may be configured todetermine one or more multimedia items and/or crops/segments of one ormore of the multimedia items whose predicted metadata 311 match theselected first portion of metadata.

In some implementations, the sorting component 340 may be configured tosorting multimedia items into a list based on one or more sortingmetrics. A sorting metric may be based on one or more of a timeparameter (e.g., newest to oldest, and/or other considerations),priority (e.g., multimedia items associated with high priority tasks maybe higher in the list than multimedia item associated with low prioritytasks, and/or other considerations), predicted confidence (e.g., sortingfrom relatively low confidence score to relatively high confidence,scores), and/or other considerations.

In some implementations, the clustering component 350 may be configuredto associate multimedia items with groups based on the semanticrepresentations 321. The groups may be associated with group tasks 221that may reduce cognitive load for the user.

In some implementations, the chunking component 360 may be configured toassociate one or more multimedia items and/or item groups in the listwith individual tasks 221 to be presented to one or more users viarespective user interfaces. The individual task 221 may be organizedinto a sequence of tasks.

In some implementations, after some time of user interaction with a userinterface (e.g., described with reference to user interface component112 herein), the consensus component 370 may be configured to determineif consensus has been reached among the responses 222 of the users forindividual ones of the multimedia items and/or individual ones of theitem groups associated with given tasks 221. Responsive to a consensusbeing reached, the task component 110 may be configured to remove thetask(s) 221 corresponding to the multimedia item and/or item group fromthe sequence of tasks. Responsive to consensus not being reached, thetask(s) 221 corresponding to the multimedia item and/or item group maybe maintained in the sequence of tasks until consensus is readied. Insome implementations, consensus may be based a threshold number of usersagreeing on the same or substantially the same metadata associations.

By way of non-limiting example, when multiple users are available, thetask component 110 (e.g., the consensus component 370) may incorporate aconsensus algorithm to determine when a particular task may be complete.In some implementations, the same task may be sent to multiple users,the responses 222 from each compared, and/or the algorithm may be usedto determined that the task has sufficient consensus (agreement betweenusers) and may be complete. In some implementations the task component110 may be configured to determine that consensus has not been reachedand/or to send the task to one or more other users. These consensusdeterminations may also be used to prioritize tasks 221 that may beconfusing to one or more users. By way of non-limiting example, thefirst context information may be associated with a first task. The firsttask may be presented to the first user, a second user, and/or otherusers. The first task may correspond to a presentation of the firstcontent information, the first multimedia item, and/or other elementspresented on a user interface. Responsive to determining a consensusbetween the first user and the second user regarding the association ofthe first context information with the first multimedia item, the firsttask may no longer be presented to the first user, second user, and/orother users.

Returning to FIG. 1, the user interface component 112 may be configuredto effectuate presentation of user interfaces. The user interfaces maybe configured to display one or more of the obtained multimedia items,the predicted context information associated with the one or moreobtained multimedia items, and/or other display elements. By way ofnon-limiting example, a first user interface may be configured todisplay the first multimedia item, the predicted first contextinformation associated with the first multimedia item, and/or otherinformation.

In some implementations, the user interface component 112 may beconfigured to effectuate display of user interfaces, including elementssuch as an array of rendered multimedia items and/or representations ofthe multimedia items (e.g., a rendered image and/or a representation ofthe image such as a written description, and/or other considerations),interaction elements facilitating user entry and/or selection ofaddition, removal, change, and/or confirmation of a predicted label forgenerating labeling responses, and/or other elements facilitating userinput.

In some implementations, a rendered multimedia item may comprise one ormore of a visual summary (e.g., an image thumbnail, video storyboard,text data, audio signals, filename, and/or other properties visuallydenoting the multimedia item), displayed metadata (e.g., labels,confidence scores, semantic representations, semantic similarity scores,and/or other information), and/or other information. In someimplementations, interaction elements for generating labeling responsesmay comprise one or more of an element for multimedia item selection(e.g., binary filtering) an element for threshold selection, an elementfor bounding box selection and/or adjustment, an element for continuousvalue refinement, an element for multiple choice (e.g., if the user ispresented with a list of options, and/or other considerations), and/orother interaction dements.

In some implementations, the user interacts with a user interface andinteraction elements to generate labeling responses. Interaction fordifferent kinds of labeling tasks 221 may include binary labeling tasks,threshold selection, flipping the default, bounding box selection,ranking tasks, dynamic sorting, and/or other tasks.

Binary labeling tasks may correspond to user selection (or deselection)of individual multimedia items and/or other interface elements that theuser judges to be relevant or not relevant to the label or othermetadata associated with a task.

Threshold selection may correspond to user selection and/or positionadjustment of a position of a marker that indicates the separationbetween items that may be relevant and/or those that may not be relevantto the label associated with the task.

Flipping the default may correspond to a threshold selection task thatmay be combined with binary labeling. By way of non-limiting example, adefault for unlabeled items may update dynamically, as a function of acurrent threshold setting. Items above the threshold may default topositive examples, and those below may default to negative examples.Unlabeled items may be labeled with a corresponding default accordingly.When the user selects and/or adjusts a threshold, items may be updatedaccordingly.

Bounding box may correspond to a polygon and/or other complex multimediaitem content refinement and/or adjustment. By way of non-limitingexample, a bounding box may be a drawing superimposed over an image thatmay indicate a predicted dimension and/or position of contents relevantto the label associated with the task. Contents may correspond to one ormore of detected objects, subjects, colors, environments, geolocations,and/or other considerations.

Ranking tasks may correspond to an iterative process in which a user mayfirst select one or more examples of positive (highly-ranked) ornegative (low-ranking) multimedia items. The task component 110 may beconfigured to update a task and/or the user interface component 112 maybe configured to update a user interface by re-sorting the multimediaitems according to a function that takes into account multimedia itemsemantic representations 321 determined during pre-processing.

Dynamic sorting may correspond to a collection of items that may bedisplayed in sorted order (e.g., as computed by the sorting component340 of the task component 110 in FIG. 3). By way of non-limitingexample, the sorting component 340 (FIG. 3) may be configured tofacilitate updating a user interface (e.g., via the user interfacecomponent 112) dynamically in response to previous user responses 222.By way of non-limiting example, a first ordering may be a function ofpredicted metadata 311 confidence computed during pre-processing, and/ora second ordering may further incorporate a similarity function to boost(e.g., sorted with higher priority in the order) items similar to thosejust labeled by a user and/or demote (e.g., sorted with a lower priorityin the order) items dissimilar to those just labeled by a user. In someimplementations, multimedia items may be updated to be higher in theorder based on output of a classifier that uses online learning toupdate its learned discriminative function in response to previous userresponses 222.

FIG. 4 illustrates an exemplary user interface 400 configured to allow auser to carry out one or more labeling tasks, in accordance with one ormore implementations. By way of non-limiting example, the user interface400 may present a first task 402 allowing the user to confirm and/orremove an association of a predicted label (e.g., shown as “Label X” forillustrative purposes) with a multimedia item 474 (e.g., an imagefeaturing multiple objects, and/or other considerations), and/or provideother input. The user interface 400 may display a second task 404allowing the user to confirm or provide other input regarding whether anobject (e.g., object “Z”) that is detected (e.g., via a bounding box472) within the multimedia item 474 should be associated with thepredicted label (e.g., Label X). The bounding box 472 may be adjustablevia one or more toggles 473 and/or other user interface elements. By wayof non-limiting example, the user may resize, move, and/or otherwisearrange the bounding box 472 such that an object or other content of themultimedia item 474 may be appropriately bound by the bounding box 472to signify association with the current label (e.g., Label X). Userinput may be provided by “drag and drop” input, selection/deselection ofcheck boxes, and/or other input techniques.

FIG. 6 illustrates an exemplary user interface 600 configured to allow auser to carry out one or more labeling tasks 221, in accordance with oneor more implementations. In some implementations, the user may bepresented with an array of multimedia items and/or representations ofmultimedia items in accordance with associating one or more of the itemswith a given label (e.g., Label X). In some implementations, some of theitems shown in the grid may be relevant to a current task label (LabelX) and/or some may not. Relevant images may be represented by somevisual indicator (e.g., a tick mark, a check mark, a visual highlight,and/or other considerations). For clarity, the actions carried out by auser preforming one or more tasks 221 presented in the user interface600 are described by list elements A, B, C, and D shown in the userinterface 600 under each of the presented tasks 221 (e.g., labeled “Task1,” “Task 2,” and “Task 3”).

In some implementations, in a first task. (e.g., labeled “Task 1”) theuser may provide entry and/or selection (e.g., via a cursor, and/orother input mechanism) of a space between two images and/or sets ofimages to indicate where to place a threshold (e.g., depicted by thedashed vertical line). The user interface 600 may update (shown in thegrid of images under “Task. 2”) to flip the default below the user-setthreshold to a negative label (shown by some visual indicator, such as astrike through the item, and/or other considerations). There may be oneor more, remaining items incorrectly labeled. In some implementations,under “Task 2,” the user may select one or more items to change itslabel. This may result in a change in the visual indicator of the item(e.g., change from a strike through the item to a check mark on theitem, and/or other considerations). “Task 3,” the user may finalizetheir actions via entry and/or selection of “OK” to indicate completion,and/or other considerations.

In some implementations, an array of rendered multimedia items may bearranged as a list, grid, stack, objects arranged in 3-D space, and/orother considerations for displaying a collection of items.

In some implementations, multimedia items may be rendered in a userinterface at sufficient size so content associated with the multimediaitems may be distinguished, but small enough so the user can viewmultiple items simultaneously.

Multimedia content with a time dimension (e.g., audio and/or videocontent) may be converted to a visual representation that may translatetime into a visual dimension, while preserving the property that similaritems appear visually similar. By way of non-limiting example, a visualrepresentation of a time series of predicted labels may include aspectrogram-like visualization, and/or other considerations. By way ofnon-limiting example, or video multimedia items, a movie-strip-likevisualization using automatic scene segmentation and/or selection of keyframes may provide a visual representation of the video.

Returning to FIG. 1, the learning component 114 may be configured togenerate a new machine learning prediction model and/or update anexisting machine learning prediction model based on the user inputcorresponding to labeling responses. By way of non-limiting example inFIG. 3, once a task has reached consensus as determined by the consensuscomponent 370, the responses 222 may be obtained by the learningcomponent 114. The responses 222 may be used to train a new predictionmodel, to improve and/or update an existing prediction model, and/orother considerations. In some implementations, one or more operations ofthe learning component 114 may be carried out in an online fashion astasks 221 reach consensus and/or in an offline fashion once a certainnumber of tasks 221 have completed and/or would consider all theresponses 222. This training (and/or retraining) process mayadditionally incorporate other metadata and/or information outside thetask responses 222. By way of non-limiting example, information may begathered from external resources (e.g., a third party webpage, and/orother resource). Data from external resources may be combined with thelabels gathered from the labeling tasks 221. Aside from improving theprediction of metadata, the training (and/or retraining) processed usingresponses 222 by the learning component 114 may be used to improve thesimilarity measures of the semantic similarity component 320 used by atrained prediction models.

In some implementations, once training (and/or retraining) has reached auseful state, a machine learning prediction model may be used togenerate new metadata predictions 311 through the metadata predictioncomponent 310 during a post processing procedure (e.g., carried out bythe preprocessing component 108, and/or other components). The processmay continue to produce new task responses 222 by users during labelingtasks 221 which may be used for training new models and/or retrainingexisting models. This process may repeat many times over until there isa global consensus on all metadata, a given prediction accuracy targethas been reached, and/or a system provider chooses to stop the process.

In FIG. 1, the server 102, user device(s) 104, repository 107, and/orexternal resources 105 may be operatively linked via one or moreelectronic communication links. For example, such electroniccommunication links may be established, at least in part, via acommunication network 106 such as the Internet and/or other networks. Itwill be appreciated that this is not intended to be limiting, and thatthe scope of this disclosure includes implementations in which server102, user device(s) 104, repository 107, and/or external resources 105may be operatively linked via some other communication media.

The external resources 105 may include sources of information, hostsand/or providers of information outside of system 100, external entitiesparticipating with system 100, and/or other resources. In someimplementations, some or all of the functionality attributed herein toexternal resources 105 may be provided by resources included in system100 (e.g., in server 102).

The server 102 may include electronic storage 116, one or moreprocessors 118, and/or other components. The server 102 may includecommunication lines or ports to enable the exchange of information witha network and/or other computing platforms. Illustration of server 102in FIG. 1 is not intended to be limiting. The server 102 may include aplurality of hardware, software, and/or firmware components operatingtogether to provide the functionality attributed herein to server 102.

Electronic storage 116 may comprise electronic storage media thatelectronically stores information. The electronic storage media ofelectronic storage 116 may include one or both of system storage that isprovided integrally (i.e., substantially non-removable) with server 102and/or removable storage that is removably connectable to server 102via, for example, a port or a drive. A port may include a USB port, afirewire port, and/or other port. A drive may include a disk driveand/or other drive. Electronic storage 116 may include one or more ofoptically readable storage media (e.g., optical disks, etc.),magnetically readable storage media (e.g., magnetic tape, magnetic harddrive, floppy drive, etc.), electrical charge-based storage media (e.g.,EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.),and/or other electronically readable storage media. The electronicstorage 116 may include one or more virtual storage resources (e.g.,cloud storage, a virtual private network, and/or other virtual storageresources). Electronic storage 116 may store software algorithms,information determined by processor 118, information received fromserver 102, and/or other information that enables server 102 to functionas described herein.

Processor(s) 118 is configured to provide information processingcapabilities in server 102. As such, processor 118 may include one ormore of a digital processor, an analog processor, a digital circuitdesigned to process information, an analog circuit designed to processinformation, a state machine, and/or other mechanisms for electronicallyprocessing information. Although processor 118 is shown in FIG. 1 as asingle entity, this is for illustrative purposes only. In someimplementations, processor 118 may include one or more components. Thesecomponents may be physically located within the same device, orprocessor 118 may represent processing functionality of a plurality ofdevices operating in coordination. The processor 118 may be configuredto execute components 108, 110, 112, and/or 114. Processor 118 may beconfigured to execute components 108, 110, 112, and/or 114 by software;hardware; firmware some combination of software, hardware, and/orfirmware; and/or other mechanisms for configuring processingcapabilities on processor 118.

It should be appreciated that, although components 108, 110, 112, and/or114 are illustrated in FIG. 1 as being co-located within a singlecomponent, in implementations in which processor 118 includes multiplecomponents, one or more of components 108, 110, 112, and/or 114 may belocated remotely from the other components. The description of thefunctionality provided by the different components 108, 110, 112, and/or114 described above is for illustrative purposes and is not intended tobe limiting, as any of components 108, 110, 112, and/or 114 may providemore or less functionality than is described. For example, one or moreof components 108, 110, 112, and/or 114 may be eliminated, and some orall of its functionality may be provided by other ones of components106, 108, 110, and/or other components. As another example, processor118 may be configured to execute one or more additional components thatmay perform some or all of the functionality attributed below to one ofcomponents 108, 110, 112, and/or 114.

FIG. 5 illustrates a method 500 of associating context information withmultimedia items, in accordance with one or more implementations. Theoperations of method 500 presented below are intended to beillustrative. In some embodiments, method 500 may be accomplished withone or more additional operations not described, and/or without one ormore of the operations discussed. Additionally, the order in which theoperations of method 500 are illustrated in FIG. 5 and described belowis not intended to be limiting.

In some embodiments, method 500 may be implemented in one or moreprocessing devices (e.g., a digital processor, an analog processor, adigital circuit designed to process information, an analog circuitdesigned to process information, a state machine, a functionally limitedprocessing device, and/or other mechanisms for electronically processinginformation). The one or more processing devices may include one or moredevices executing some or all of the operations of method 500 inresponse to instructions stored electronically on an electronic storagemedium. The one or more processing devices may include one or moredevices configured through hardware, firmware, and/or software to bespecifically designed for execution of one or more of the operations ofmethod 500.

Referring now to method 500 in FIG. 5, at an operation 502, one or moremultimedia items may be obtained from a repository 107 of multimediaitems. In some implementations, operation 502 may be performed by apreprocessing component the same as or similar to preprocessingcomponent 108 (shown in FIG. 1 and described herein).

At an operation 504, predicted context information may be associatedwith individual ones of the multimedia items. The context informationmay take the form of a label, a tag, and/or other information associatedwith the multimedia items. Context information may be stored as metadataassociated with the multimedia items. In some implementations, operation504 may be performed by a preprocessing component the same as or similarto the preprocessing component 108 (shown in FIG. 1 and describedherein).

At an operation 506, user interfaces may be presented to users onassociated user devices. The user interfaces may be configured todisplay one or more of the obtained multimedia items and/or thecorresponding predicted context information for the multimedia items.The user interface may facilitate users carrying one or more labelingtasks 221, and/or other operations. In some implementations, operation506 may be performed by a user interface component the same as orsimilar to the user interface component 112 (shown in FIG. 1 anddescribed herein).

At an operation 508, entry and/or selection of user input may beobtained from the users. Entry and/or selection of user input may beaccomplished via interaction elements of the user interfaces. The userinput may be related to the association of the predicted contextinformation with the multimedia items display in the user interfaces. Byway of non-limiting example, the users may provide input related toadding context information associations, removing predicted contextinformation associations, changing predicting context informationassociations, confirming predicted context information association,and/or other input. In some implementations, operation 508 may beperformed by a user interface component the same as or similar to theuser interface component 112 (shown in FIG. 1 and described herein).

Although the present technology has been described in detail for thepurpose of illustration based on what is currently considered to be themost practical and preferred implementation, it is to be understood thatsuch detail is solely for that purpose and that the technology is notlimited to the disclosed implementations, but, on the contrary, isintended to cover modifications and equivalent arrangements that arewithin the spirit and scope of the appended claims. For example, it isto be understood that the present technology contemplates that, to theextent possible, one or more features of any implementation can becombined with one or more features of any other implementation.

1.-20. (canceled)
 21. A system for content labeling of multimedia itemscomprising: a computer system comprising one or more processorsprogrammed with computer program instructions that, when executed, causethe computer system to perform operations comprising: determining a setof multimedia items based on multimedia items and corresponding labelspredicted by an instance of a neural network; causing a first graphicalrepresentation of the multimedia items to be presented as a first arrayof multimedia items on a user interface, the first array of multimediaitems arranged for a user to complete a first task; causing a secondgraphical representation of the multimedia items to be presented as asecond array of multimedia items on the user interface, the second arrayof multimedia items arranged for a user to complete a second task,wherein the first array of multimedia items and the second array ofmultimedia items are presented at a same time on the user interface;obtaining, via the user interface, at least one user indication for atleast one of the first array of multimedia items and the second array ofmultimedia items, wherein the user indication is related to at least oneof where to place a threshold and whether to change a label; andproviding, to the neural network, the at least one user indication tocause the neural network to be updated based on the at least one userindication.
 22. The system of claim 21 further comprising providing, toat least one other neural network, the at least one user indication tocause the at least one other neural network to be updated.
 23. Thesystem of claim 21, wherein at least one of the first and the secondarray of multimedia items is arranged for display in a list, grid,stack, or objects arranged in 3-D space.
 24. The system of claim 21,wherein at least one of the first and the second array of multimediaitems are rendered for display in the user interface at sufficient sizeso content associated with each multimedia item may be distinguished,but small enough so the user can view multiple multimedia itemssimultaneously.
 25. The system of claim 21, further comprising:obtaining, via the user interface, at least one user change of a labelof a multimedia item of at least one of the first array and the secondarray of multimedia items; and providing, to the neural network, the atleast one change of the label to cause the neural network to be updated.26. The system of claim 21, further comprising, after determining theset of multimedia items based on multimedia items and correspondinglabels predicted by an instance of the neural network: convertingmultimedia items having a time dimension into a visual representationwhile preserving a property enabling such multimedia items to appearvisually similar to multimedia items not having a time dimension.
 27. Amethod for content labeling of multimedia items comprising: determining,by a computer system, a set of multimedia items based on multimediaitems and corresponding labels predicted by an instance of a neuralnetwork; causing, by the computer system, a first graphicalrepresentation of the multimedia items to be presented as a first arrayof multimedia items on a user interface, the first array of multimediaitems arranged for a user to complete a first task; causing, by thecomputer system, a second graphical representation of the multimediaitems to be presented as a second array of multimedia items on the userinterface, the second array of multimedia items arranged for a user tocomplete a second task, wherein the first array of multimedia items andthe second array of multimedia items are presented at a same time on theuser interface; obtaining, by the computer system via the userinterface, at least one user indication for at least one of the firstarray of multimedia items and the second array of multimedia items,wherein the user indication is related to at least one of where to placea threshold and whether to change a label; and providing, by thecomputer system to the neural network, the at least one user indicationto cause the neural network to be updated based on the at least one userindication.
 28. The method of claim 27 further comprising providing, bythe computer system to at least one other neural network, the at leastone user indication to cause the at least one other neural network to beupdated.
 29. The method of claim 27, wherein at least one of the firstand the second array of multimedia items is arranged for display in alist, grid, stack, or objects arranged in 3-D space.
 30. The method ofclaim 27, wherein at least one of the first and the second array ofmultimedia items are rendered for display in the user interface atsufficient size so content associated with each multimedia item may bedistinguished, but small enough so the user can view multiple multimediaitems simultaneously.
 31. The method of claim 27, further comprising,after determining the set of multimedia items based on multimedia itemsand corresponding labels predicted by an instance of the neural network:converting, by the computer system, multimedia items having a timedimension into a visual representation while preserving a propertyenabling such multimedia items to appear visually similar to multimediaitems not having a time dimension.
 32. The method of claim 27, furthercomprising: obtaining, by the computer system via the user interface, atleast one user change of a label of a multimedia item of at least one ofthe first array and the second array of multimedia items; and providing,by the computer system to the neural network, the at least one change ofthe label to cause the neural network to be updated.
 33. Anon-transitory computer-readable media comprising instructions that,when executed by at least one processor, causes operations comprising:determining a set of multimedia items based on multimedia items andcorresponding labels predicted by an instance of a neural network;causing a first graphical representation of the multimedia items to bepresented as a first array of multimedia items on a user interface, thefirst array of multimedia items arranged for a user to complete a firsttask; causing a second graphical representation of the multimedia itemsto be presented as a second array of multimedia items on the userinterface, the second array of multimedia items arranged for a user tocomplete a second task, wherein the first array of multimedia items andthe second array of multimedia items are presented at a same time on theuser interface; obtaining, via the user interface, at least one userindication for at least one of the first array of multimedia items andthe second array of multimedia items, wherein the user indication isrelated to at least one of where to place a threshold and whether tochange a label; and providing, to the neural network, the at least oneuser indication to cause the neural network to be updated based on theat least one user indication.
 34. The non-transitory computer-readablemedia of claim 33 comprising further instructions that, when executed bythe at least one processor, causes providing, to at least one otherneural network, the at least one user indication to cause the at leastone other neural network to be updated.
 35. The non-transitorycomputer-readable media of claim 33 wherein the instructions for causingthe first graphical representation of the multimedia items to bepresented as a first array of multimedia items on the user interface andfor causing the second graphical representation of the multimedia itemsto be presented as a second array of multimedia items on the userinterface causes an array of rendered multimedia items arranged as atleast one of a list, grid, stack, or objects arranged in 3-D space. 36.The non-transitory computer-readable media of claim 33 wherein theinstructions for causing the first graphical representation of themultimedia items to be presented as a first array of multimedia items onthe user interface and for causing the second graphical representationof the multimedia items to be presented as a second array of multimediaitems on the user interface causes rendering for display in the userinterface at sufficient size so content associated with each multimediaitem may be distinguished, but small enough so the user can viewmultiple multimedia items simultaneously.
 37. The non-transitorycomputer-readable media of claim 33 comprising further instructionsthat, when executed by the at least one processor, causes operationscomprising: obtaining, via the user interface, at least one user changeof a label of a multimedia item of at least one of the first array andthe second array of multimedia items; and providing, to the neuralnetwork, the at least one change of the label to cause the neuralnetwork to be updated.
 38. The non-transitory computer-readable media ofclaim 13 wherein after the instructions for determining the set ofmultimedia items based on multimedia items and corresponding labelspredicted by an instance of the neural network, further instructionsthat, when executed by the at least one processor, causes operationscomprising: converting multimedia items having a time dimension into avisual representation while preserving a property enabling suchmultimedia items to appear visually similar to multimedia items nothaving a time dimension.